Skip to content

Fix CellPipe topology FQCN naming#4835

Open
nvidianz wants to merge 15 commits into
NVIDIA:mainfrom
nvidianz:nvbugs-6371056-cellpipe-topology-name
Open

Fix CellPipe topology FQCN naming#4835
nvidianz wants to merge 15 commits into
NVIDIA:mainfrom
nvidianz:nvbugs-6371056-cellpipe-topology-name

Conversation

@nvidianz

@nvidianz nvidianz commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Change CellPipe FQCN construction so the FQCN parent matches the actual connected cell topology, using names such as site-1.<jobid>_active instead of site-1.<jobid>.active.
  • Add routing coverage for the NVBug 6371056 streaming download path from a subprocess CellPipe cell to server.<jobid> through the client CP.
  • Update mTLS identity and stream auth coverage for topology-shaped direct and relay CellPipe names.

Root Cause

PR #4801 changed CellPipe cells to hierarchical names such as site-1.<jobid>.active. The subprocess cell connects physically to site-1, but its FQCN parent became site-1.<jobid>, which is not connected. Cross-family routing to server.<jobid> then fell back to the missing FQCN parent and returned TARGET_UNREACHABLE.

Validation

  • ~/nvflare-env/3.12/bin/python -m pytest tests/unit_test/fuel/utils/pipe/cell_pipe_test.py tests/unit_test/fuel/f3/cellnet/core_cell_routing_test.py tests/unit_test/fuel/f3/cellnet/identity_binding_test.py tests/unit_test/private/fed/authenticator_test.py -q
  • ~/nvflare-env/3.12/bin/python -m black --check ... on touched files
  • ~/nvflare-env/3.12/bin/python -m isort --check-only ... on touched files
  • ~/nvflare-env/3.12/bin/python -m flake8 ... on touched files

Note: repo-wide PATH="$HOME/nvflare-env/3.12/bin:$PATH" ./runtest.sh -s was attempted, but failed on unrelated checked-in examples/.../node_modules/.../flatted.py files that black wants to reformat.

NVBug: 6371056

Also included (separable)

  • ccwf: ClientSideController._do_learn now records ReturnCode.EXECUTION_EXCEPTION when do_learn_task raises, so the job ends with an error status instead of FINISHED:COMPLETED. This also fixes the broken self.logger.log(msg) call in that handler (missing level argument).

@nvidianz nvidianz marked this pull request as ready for review June 26, 2026 17:36
@codecov-commenter

codecov-commenter commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.05%. Comparing base (57c403c) to head (100e8e8).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4835      +/-   ##
==========================================
+ Coverage   56.97%   57.05%   +0.07%     
==========================================
  Files         969      969              
  Lines       92285    92322      +37     
==========================================
+ Hits        52578    52671      +93     
+ Misses      39707    39651      -56     
Flag Coverage Δ
unit-tests 57.05% <100.00%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes CellPipe FQCN naming so that a pipe cell's FQCN parent matches the cell it physically connects to, resolving TARGET_UNREACHABLE routing failures (NVBug 6371056) introduced by PR #4801's hierarchical naming scheme.

  • Core naming change: _cell_fqcn now emits a single prefixed leaf segment (cellpipe-<token>_<mode> for CP/root connections; cellpipe-alias-<owner>_<token>_<mode> for relay connections) instead of the intermediate hierarchical form <site>.<token>.<mode> that created unconnected FQCN parents.
  • Identity resolution & stream auth: identity.py and authenticator.py updated to gate legacy bare-alias parsing to single-segment FQCNs and use the new parse_cell_pipe_alias grammar, preventing fabricated alias owners from underscore-bearing tokens.
  • _do_learn fix: ClientSideController._do_learn now records ReturnCode.EXECUTION_EXCEPTION on do_learn_task failure so the job ends with an error status instead of FINISHED:COMPLETED, and fixes the broken self.logger.log(msg) call.

Confidence Score: 5/5

Safe to merge; the fix is self-contained, well-tested, and the unreleased hierarchical naming scheme (#4801) carries no backward-compatibility burden.

The topology naming change is logically correct: each connection scenario (root, own CP, relay, CP-behind-relay, missing parent) is covered by dedicated parametrized tests verifying both happy paths and all ValueError cases. The parse_cell_pipe_alias grammar shared across naming, identity resolution, and stream auth is tested with round-trip, legacy-bare, right-anchored, and rejection cases. Identity resolution and _origin_matches_fqcn handle the new prefix correctly, including the subtle parent-equality guard for relay aliases. The _do_learn bugfix is small and independently tested. No known-good behavior is regressed.

No files require special attention. nvflare/private/fed/authenticator.py contains the most nuanced logic (parent-equality guard + alias-marker gate for stream auth), but the new rejection tests cover the security-relevant paths.

Important Files Changed

Filename Overview
nvflare/fuel/utils/pipe/cell_pipe.py Core naming fix: _cell_fqcn rewritten to emit topology-shaped names with one leaf segment; adds input validation (empty token, alias- prefix, dot/underscore restrictions per parent type) with clear ValueErrors.
nvflare/fuel/f3/cellnet/fqcn.py New make_cell_pipe_alias / parse_cell_pipe_alias grammar shared by naming, identity resolution, and stream auth; right-anchored parsing prevents owner fabrication from underscored tokens; legacy bare-alias format still accepted for single-segment FQCNs.
nvflare/fuel/f3/cellnet/identity.py Identity resolution now correctly gates bare-alias parsing to cellpipe-alias- leaves or single-segment FQCNs, and resolves plain cellpipe- leaves to their parent's identity; prevents fabricated owners from underscore tokens.
nvflare/private/fed/authenticator.py _origin_matches_fqcn rewritten: relay-alias stream origins now require matching FQCN parent AND explicit cellpipe-alias- marker at depth; legacy flat aliases still accepted for root-level single-segment origins.
nvflare/app_common/ccwf/client_ctl.py Bugfix: exception from do_learn_task now records ReturnCode.EXECUTION_EXCEPTION via update_status and uses log_exception (replaces the broken self.logger.log(msg) call missing the level argument).
nvflare/fuel/f3/cellnet/core_cell.py Comment-only update: updates the inline documentation of the root-connector fall-through in _try_find_ep to reflect the new naming scheme and cites the covering test.
tests/unit_test/fuel/f3/cellnet/fqcn_test.py New test file with round-trip, legacy-bare, right-anchored-parsing, and rejection tests for the parse_cell_pipe_alias grammar; also covers the caller-must-gate ambiguity for plain leaves with underscore tokens.
tests/unit_test/fuel/utils/pipe/cell_pipe_test.py Expanded to cover all four parent types, missing parent fallback, and all validation error cases (empty token, alias- prefix, dot/underscore restrictions); previously untested edge cases now have parametrized coverage.
tests/unit_test/fuel/f3/cellnet/core_cell_routing_test.py Test FQCNs updated to topology naming; two new tests cover server-job routing through a connected CP and relay-alias pipes routing to a server job cell through their relay.
tests/unit_test/fuel/f3/cellnet/identity_binding_test.py Updated for topology naming; adds tests for underscore-token plain leaves, CP resolving its own pipe child, relay-alias cell from a distant resolver, and the explicit alias marker at any depth.
tests/unit_test/private/fed/authenticator_test.py New tests cover relay-alias stream auth acceptance and rejection (different relay parent, unmarked alias shape at depth); tests confirm the two-part parent+owner check prevents spoofing.
tests/unit_test/app_common/ccwf/client_ctl_test.py New test file verifying that _do_learn logs the exception and records EXECUTION_EXCEPTION in the status report when do_learn_task raises.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_cell_fqcn(mode, site, token, parent_fqcn)"] --> V1{token empty?}
    V1 -->|yes| E1["ValueError: token must be non-empty"]
    V1 -->|no| V2{token starts with 'alias-'?}
    V2 -->|yes| E2["ValueError: 'alias-' prefix reserved"]
    V2 -->|no| LEAF["cell_name = cellpipe-{token}_{mode}"]
    LEAF --> P1{parent_fqcn == ROOT_SERVER?}
    P1 -->|yes| R1["prefix = site_name\n(dotted token allowed; routes via root fall-through)"]
    P1 -->|no| P2{parent_fqcn ends with site_name?}
    P2 -->|yes| V3{'.' in token?}
    V3 -->|yes| E3["ValueError: '.' splits FQCN segments"]
    V3 -->|no| R2["prefix = parent_fqcn\n(plain leaf under own CP)"]
    P2 -->|no| P3{parent_fqcn non-empty?}
    P3 -->|yes| V4{"'_' or '.' in token?"}
    V4 -->|yes| E4["ValueError: must not contain '_' or '.'"]
    V4 -->|no| R3["prefix = parent_fqcn\ncell_name = cellpipe-alias-{site}_{token}_{mode}"]
    P3 -->|no| R4["prefix = site_name\n(fallback: warn + root routing)"]
    R1 --> OUT["FQCN.join([prefix, cell_name])"]
    R2 --> OUT
    R3 --> OUT
    R4 --> OUT
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["_cell_fqcn(mode, site, token, parent_fqcn)"] --> V1{token empty?}
    V1 -->|yes| E1["ValueError: token must be non-empty"]
    V1 -->|no| V2{token starts with 'alias-'?}
    V2 -->|yes| E2["ValueError: 'alias-' prefix reserved"]
    V2 -->|no| LEAF["cell_name = cellpipe-{token}_{mode}"]
    LEAF --> P1{parent_fqcn == ROOT_SERVER?}
    P1 -->|yes| R1["prefix = site_name\n(dotted token allowed; routes via root fall-through)"]
    P1 -->|no| P2{parent_fqcn ends with site_name?}
    P2 -->|yes| V3{'.' in token?}
    V3 -->|yes| E3["ValueError: '.' splits FQCN segments"]
    V3 -->|no| R2["prefix = parent_fqcn\n(plain leaf under own CP)"]
    P2 -->|no| P3{parent_fqcn non-empty?}
    P3 -->|yes| V4{"'_' or '.' in token?"}
    V4 -->|yes| E4["ValueError: must not contain '_' or '.'"]
    V4 -->|no| R3["prefix = parent_fqcn\ncell_name = cellpipe-alias-{site}_{token}_{mode}"]
    P3 -->|no| R4["prefix = site_name\n(fallback: warn + root routing)"]
    R1 --> OUT["FQCN.join([prefix, cell_name])"]
    R2 --> OUT
    R3 --> OUT
    R4 --> OUT
Loading

Reviews (14): Last reviewed commit: "Use explicit cellpipe- and cellpipe-alia..." | Re-trigger Greptile

Comment thread nvflare/fuel/utils/pipe/cell_pipe.py
Comment thread tests/unit_test/fuel/utils/pipe/cell_pipe_test.py Outdated

@chesterxgchen chesterxgchen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed:

  • CellPipe FQCN naming changed from hierarchical runtime segments like:

site-1.job-123.active

to topology-aligned leaf names like:

site-1.job-123_active

See /private/tmp/nvflare-pr4835/nvflare/fuel/utils/pipe/cell_pipe.py:50. This makes the FQCN parent site-1, which is the cell the
subprocess actually connects to, instead of the unconnected pseudo-parent site-1.job-123.

  • Routing tests now cover the reported case: a pipe cell named site-1.job-123_active, connected only to site-1, can route to
    server.job-123 through site-1. See /private/tmp/nvflare-pr4835/tests/unit_test/fuel/f3/cellnet/core_cell_routing_test.py:46.

  • Identity/auth code was adjusted so these alias-style CellPipe stream cells still authenticate as the owning site, but only on the
    stream channel and under the same parent. See /private/tmp/nvflare-pr4835/nvflare/private/fed/authenticator.py:306.

  • It does not generally change CoreCell._try_find_ep() routing behavior, except comments. The fix is mainly “name the CellPipe cell so
    the existing parent fallback works.”

Validation I ran on PR 4835:

90 passed

for:

tests/unit_test/fuel/utils/pipe/cell_pipe_test.py
tests/unit_test/fuel/f3/cellnet/core_cell_routing_test.py
tests/unit_test/fuel/f3/cellnet/identity_binding_test.py
tests/unit_test/private/fed/authenticator_test.py

Does it fix the reported problem?

Likely yes for the primary failure:

ext-process + streaming -> target_unreachable on server.

The new naming restores the parent lookup path described in your root cause. For site-1.job-123_active, the parent fallback is site-1,
and that is the connected client cell, so routing to server. should no longer die at TARGET_UNREACHABLE.

What it does not fix:

  • The secondary logger bug is still there:

self.logger.log(f"exception from do_learn_task: ...")

in /private/tmp/nvflare-pr4835/nvflare/app_common/ccwf/client_ctl.py:464. That still should be self.logger.error(...).

  • The “job ends FINISHED:COMPLETED after TASK_ABORTED” behavior is not addressed by this PR. If the CellPipe routing fix prevents the
    abort, that symptom disappears for this case, but the controller semantics are unchanged for any future abort path.

My read: PR 4835 fixes the actual stream-routing regression, but it is incomplete relative to the full issue report because it leaves
the logger typo and abort/status behavior untouched. It also has unit coverage, not a full POC ext-process streaming regression test.

@nvidianz nvidianz requested a review from chesterxgchen June 26, 2026 19:03
YuanTingHsieh
YuanTingHsieh previously approved these changes Jun 27, 2026

@chesterxgchen chesterxgchen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvidianz before you merge, can you address my comments

@nvidianz

nvidianz commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

@chesterxgchen Thanks for the detailed review — to your points (updated: all three are now addressed in this PR):

  1. client_ctl.py logger bug — fixed. The snapshot you reviewed was the first commit (599e35a); b0a5145 changed self.logger.log(...) to self.logger.error(...) in _do_learn, and 46fccda added a unit test (test_do_learn_logs_exception_from_learn_task) that verifies the exception is logged and the learn task is cleared.

  2. FINISHED:COMPLETED after TASK_ABORTED — now addressed at the ccwf level in 8837e2c. _do_learn previously swallowed exceptions from do_learn_task without recording them, so the server controller never saw the failure and the job completed "normally". It now records ReturnCode.EXECUTION_EXCEPTION in the client status; the server controller's existing handling (system_panic on error reports → FATAL_SYSTEM_ERRORUPDATE_RUN_STATUS(execution_error=True)) then ends the job as FINISHED:EXECUTION_EXCEPTION. Executor-returned failure rc's (including TASK_ABORTED) were already reported by swarm_client_ctl; only this exception path was silent. One remaining hardening — pushing error reports immediately via aux message instead of piggybacking on task pulls — is planned as a separate small PR.

  3. Coverage44fa08d adds an ext_process_streaming integration test group so the exact regression path (subprocess CellPipe cell → server.<jobid> streaming/download) can be run on its own. I ran it against this branch: both jobs passed (POC, 1 server / 2 clients), all 8 MB download transactions finished, no TARGET_UNREACHABLE during execution — full results in the validation comment below.

Please take another look when you get a chance.

@nvidianz nvidianz requested a review from chesterxgchen July 1, 2026 20:42
An exception from do_learn_task was only logged; the client never set the
error in its status report, so the ccwf server controller never saw the
failure and the job ended FINISHED:COMPLETED. Record the error so the next
status report triggers system_panic on the server, ending the job as
FINISHED:EXECUTION_EXCEPTION.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@nvidianz

nvidianz commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up on the two remaining review points — both are now part of this PR:

1. FINISHED:COMPLETED after TASK_ABORTED — addressed at the ccwf level in 8837e2c. ClientSideController._do_learn now records ReturnCode.EXECUTION_EXCEPTION in its status when do_learn_task raises (previously the exception was only logged). The next status report makes the server controller system_panic, which flows through FATAL_SYSTEM_ERRORUPDATE_RUN_STATUS(execution_error=True) → job ends FINISHED:EXECUTION_EXCEPTION. The unit test asserts both the recorded error and that _get_status_report() delivers it. (Note: executor-returned failure rc's were already reported by swarm_client_ctl; only the exception path was silent.)

2. Ext-process streaming regression coverage44fa08d adds an ext_process_streaming group to the integration test configs, grouping the two POC test cases that exercise this exact path so it can be run standalone (NVFLARE_TEST_FRAMEWORK=ext_process_streaming pytest system_test.py):

  • np-loop-cell-pipe (ClientAPILauncherExecutor + SubprocessLauncher + CellPipe)
  • pt-large-model-pass-through (8 MB model forces the download-service route: subprocess CellPipe cell ↔ server.<jobid>)

Ran on this branch (1 server, 2 clients): 1 passed in 99.72s (both jobs), all server download txs status=finished (3 × 8,439,348 bytes for pass-through), both client worker processes exited RC 0, and no TARGET_UNREACHABLE / cannot forward during job execution. The only stream errors in the log are post-job teardown ACK noise (sm__ACK to the already-exited subprocess cell), which is a separate pre-existing race being fixed under NVBug 6389772.

Remaining follow-up (separate PR): push error status reports to the server immediately via aux message instead of piggybacking on task pulls, closing the end-of-run race window where a recorded error might not be delivered.

Groups the two existing ext-process Client API test cases (np_loop_cell_pipe
and pt_large_model_pass_through) so the subprocess CellPipe -> server.<jobid>
streaming routing path broken by the FQCN naming regression can be exercised
on its own as a regression test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@chesterxgchen

chesterxgchen commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Review: PR #4835 — Fix CellPipe topology FQCN naming

The PR fixes NVBug 6371056 by collapsing the CellPipe token+mode into a single leaf segment (site-1._active instead of site-1..active), so the FQCN parent matches the physically connected cell and cross-family routing to server. no longer falls back to an unconnected parent. It rewrites the stream-alias check in authenticator.py to a parent-equality + owner-leaf rule, and updates routing/identity/auth test coverage accordingly. The core fix is sound — the two high-severity candidates my finders raised (peer-FQCN disagreement between pipe ends, relay+VIA_ROOT auth rejection) were both refuted on verification: pipe ends provably compute names from identical exported conn props, and the VIA_ROOT scenario is pre-existing on main and unreachable via ScriptRunner.

Surviving findings (most severe first)

  1. nvflare/fuel/utils/pipe/cell_pipe.py:66 — User-controlled tokens containing _ or . break the now load-bearing alias grammar, with no validation at construction (PLAUSIBLE). CellPipe.init only does check_str(token), and FlareAgentWithCellPipe(agent_id=...) passes a free-form user string straight through. Behind a relay, token="my_token" yields leaf site-1_my_token_active: identity.py parses owner site-1_my → mTLS require_match closes the connection, and authenticator.ing _ → stream messages UNAUTHENTICATED.This worked on main's dotted naming, so iustom-token-behind-relay; a . in the tokenrecreates the unconnected-parent bug thise UUID tokens, so no default breaks — the actionable fix is fail-fast validation (reject /. in token) in CellPipe.init or cell_fqcn. 2. nvflare/private/fed/authenticator.py:3><runtime_id>(active|passive) is nowindependently encoded in three modules (Cell_pipe.py:66, parsed in identity.py:144(_get_cell_pipe_alias_owner), re-parsed itive"/"passive" literals repeated andequivalence maintained only by comments. e; a future grammar change applied to oneparser desynchronizes mTLS identity resolA shared parse_cell_pipe_alias() next toFQCN (both files already import it) removes the drift risk.
  2. nvflare/fuel/utils/pipe/cell_pipe.py:50 — The _cell_fqcn comment "its FQCN parent matches its physical cellnet parent" is factually wrong for th fix). The ROOT_SERVER case (everysimulator run, VIA_ROOT) names the cell sysically connecting to the server root —it works only via the _try_find_ep fall-through in core_cell.py:1181, which the PR's own test_pipe_cell_reaches_peer_through_serveiner taking the comment at face valuecould remove the fall-through as dead codption and cross-reference the fall-through as load-bearing. 4. nvflare/fuel/utils/pipe/cell_pipe.py:6ilently fabricates a topology-shaped name(PLAUSIBLE, minor). If conn props ever caFQCN, the cell would be namedsite-1._active while connecting to showed the routing fall-through does notrescue relay-connected cells (only root-cal bug's symptom would return with acorrect-looking name. No in-tree producers is defensive-path hardening: a warninglog in the else branch would make the deg5. nvflare/fuel/utils/pipe/cell_pipe.py:5tical results (PLAUSIBLE, optional).parent_fqcn == FQCN.ROOT_SERVER and the final else both yield prefix = site_name + plain token_mode leaf; collapsing to if not parent_fqcn or parens verified behavior-identical over thewhole input space. Trade-off: the separatrationales, so this may be declined aschurn.6. tests/unit_test/app_common/ccwf/test_cs the documented [module_name]test.pyconvention (PLAUSIBLE, optional nit). CLAy; repo-wide the ratio is 331 test.py vs55 test.py, and every other test file tle — but 9 of 10 pre-existing files inthis exact directory use test*.py, so lorrent name. Pytest collects both eitherway.Aside (out of diff scope, spotted while tcript_runner.py:168 hasvalid_connect_types = [VIA_CP, VIA_RELAY, VIA_RELAY] — a copy-paste duplicate that omits VIA_ROOT, so pipe_connect_type=VIA_ROOT raises ValueErtests specifically for the VIA_ROOTtopology. Pre-existing, but worth a follo

Address review findings on the alias naming:
- The <owner>_<runtime_id>_(active|passive) grammar was independently
  encoded in cell_pipe.py (build), identity.py (parse) and
  authenticator.py (re-parse). Move it to fqcn.py as
  make_cell_pipe_alias/parse_cell_pipe_alias so the three sites cannot
  drift apart.
- Fail fast on tokens that break the naming: "." always splits the name
  into extra FQCN segments; "_" breaks alias parsing when the cell
  connects through another cell. "_" stays allowed in non-alias forms
  since the simulator uses the "simulate_job" token.
- Correct the _cell_fqcn comment: root-connected pipe cells do NOT have
  a connected FQCN parent; cross-reference the load-bearing routing
  fall-through in CoreCell._try_find_ep and its test.
- Warn when conn props carry no parent FQCN instead of silently
  fabricating a topology-shaped name.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@nvidianz

nvidianz commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

@chesterxgchen Thanks — findings 1-4 are addressed in 67ef2db; 5 and 6 declined with rationale below.

1. Token validation_cell_fqcn now fails fast: . is rejected for all connect paths (it always splits the name into extra FQCN segments, recreating the unconnected-parent bug), and an empty or _-containing token is rejected on the through-another-cell path where the token becomes the alias runtime id. One nuance: _ cannot be rejected unconditionally because the simulator uses "simulate_job" as the CellPipe token (nvflare/client/config.py), which is fine in the non-alias forms — covered by new unit tests (test_token_with_dot_is_rejected, test_bad_alias_token_is_rejected_behind_relay, test_underscore_token_is_allowed_when_not_aliased).

2. Grammar triplication — the alias grammar now lives in one place: make_cell_pipe_alias / parse_cell_pipe_alias in nvflare/fuel/f3/cellnet/fqcn.py. cell_pipe.py builds with it, and both identity.py and authenticator.py parse with it (the authenticator's owner check reduces to parsed[0] == owner, verified behavior-identical by the existing accept/reject test matrix).

3. Wrong comment — rewritten: the comment now states explicitly that root-connected pipe cells (simulator/root-url) have an unconnected FQCN parent and rely on the _try_find_ep fall-through, and the fall-through comment in core_cell.py now marks itself load-bearing with a cross-reference to test_pipe_cell_reaches_peer_through_server_root.

4. Silent fallback — the missing-parent-FQCN branch now logs a warning so a misconfiguration that produces a correct-looking name is diagnosable.

5. Identical branches — declined as you anticipated: with the warning from (4), the ROOT_SERVER and missing-parent branches are no longer behavior-identical, and they document different rationales.

6. Test file naming — keeping test_client_ctl.py: 9 of the 10 pre-existing files in tests/unit_test/app_common/ccwf/ use the test_*.py form, so the local directory convention wins over the repo-wide one.

Aside (script_runner.py:168 VIA_ROOT omitted from valid_connect_types) — agreed it's a real pre-existing bug; tracking it as a separate follow-up since it's out of this PR's diff scope.

All affected unit suites pass (107 targeted + 164 across cellnet/pipe), style checks clean.

@nvidianz nvidianz requested a review from YuanTingHsieh July 1, 2026 22:09
@chesterxgchen

Copy link
Copy Markdown
Collaborator

Review: PR #4835 — Fix CellPipe topology FQCN naming

Overview. The PR fixes the NVBug 6371056 routing regression from #4801 by collapsing the CellPipe token and mode into one leaf segment (site-1._active instead of site-1..active) so a subprocess cell's FQCN parent matches the cell it actually connects to, reintroduces a shared alias grammar (make_cell_pipe_alias/parse_cell_pipe_alias) for relay-connected pipes, and updates mTLS identity resolution and stream auth accordingly. It also bundles an unrelated ccwf fix that reports do_learn_task exceptions to the server. The core routing fix is sound — I verified (and rejected) candidates claiming peer-name divergence between pipe ends and broken relay stream auth; both ends of a pipe pair always derive the same parent from shared conn props, and the relay auth paths improve under this PR.

Findings (most severe first):

  1. nvflare/fuel/f3/cellnet/identity.py:159 — CONFIRMED: the new _ leaf collides with the alias grammar, breaking mTLS identity resolution for underscore tokens. _cell_fqcn explicitly allows _ in tokens for root/own-CP connections (its own test pins site-1.simulate_job_active), but CellIdentityResolver.resolve() alias-parses the leaf before the parts[0] fallback. parse_cell_pipe_alias("ext_trainer_active") returns owner "ext", and in a default secure deployment the prefix identity map is sparse (provisioning omits identities equal to the name), so the server resolves the cell to expected CN "ext", rejects the legitimate site-1 certificate, and closes the handshake. The verifier executed a repro against PR-head code: the documented 3rd-party integration (FlareAgentWithCellPipe with the docs' own agent_id="ext_trainer", secure_mode=True) can never connect, while the same config resolves to site-1 on base. Job pipes (UUID tokens) and the non-secure simulator are unaffected — the broken shape is precisely the documented secure external-trainer integration.
  2. nvflare/fuel/utils/pipe/cell_pipe.py:68 — CONFIRMED (over-broad guard): the unconditional ValueError on . in tokens breaks dotted agent_ids that worked on base in root-connected topologies. On base, FlareAgentWithCellPipe(agent_id="agent.v2") built site-1.agent.v2.active, which routed (root-connected cells resolve via the direct root agent lookup, not the phantom-parent branch the NVBug hit) and authenticated fine. After this PR the constructor raises. The fail-fast is defensible for CP/relay-connected pipes where a dotted token would recreate the unreachable-parent bug, but it's applied to the ROOT_SERVER branch too, where dotted names worked and would continue to work. Consider scoping the rejection to the branches where it actually breaks naming, or calling it out as an intentional API restriction in release notes. Likelihood is low (job tokens are UUIDs; only user-chosen agent ids are exposed).
  3. nvflare/app_common/ccwf/client_ctl.py:464 — CONFIRMED (cleanup): use self.log_exception(t.fl_ctx, ...) instead of raw self.logger.error(...). t.fl_ctx is in scope, and the file's sibling exception paths — including _process_learn_request at line 556, which pairs log_exception with the exact same update_status(error=EXECUTION_EXCEPTION) call — all use the fl_ctx-aware FLComponent helpers that prefix identity/run/peer context. Note log_exception appends the traceback itself, and the new unit test asserts on logger.error.call_args, so both would need a small adjustment.
  4. tests/unit_test/app_common/ccwf/test_client_ctl.py:1 — conventions: file name violates the CLAUDE.md test naming rule. CLAUDE.md states test files follow the [module_name]test.py pattern, so this should be client_ctl_test.py — every other test file this PR touches uses the suffix form. Caveat: 10 of the 11 existing files in that ccwf directory already use the test* prefix, so local practice contradicts the documented rule; flagging for consistency with the repo doc, your call.

Verified and dismissed (for the record): relay vs CP-behind-relay pipe ends computing different names (both ends always share conn props, so parents always match); the _ token ValueError behind relays (unreachable from every documented/in-tree path — standalone agents always take the ROOT_SERVER branch); VIA_ROOT stream auth for relay-registered clients (rejected identically on base — pre-existing, and this PR improves the supported VIA_CP/VIA_RELAY paths); and the ccwf abort-path error report (in-tree do_learn_task implementations return cleanly on abort, and workflow_done gates any post-abort report from reaching the server).

The one item I'd treat as blocking is finding 1 — it's a hard regression of the documented secure 3rd-party integration path, with an executed repro. Findings 2–4 are act-on-but-minor.

Address review findings on the CellPipe topology naming:

1. CellIdentityResolver only treats a leaf as a CellPipe alias when the
   FQCN is a single segment (legacy sibling alias) or a direct child of
   the local cell (a pipe connected through this cell, e.g. a relay).
   Anywhere else the leaf is a plain <token>_<mode> segment whose token
   may contain "_" (e.g. site-1.ext_trainer_active), and alias-parsing
   it fabricated a wrong owner ("ext"), breaking mTLS identity
   resolution for the documented secure external-trainer integration.

2. _cell_fqcn only rejects "." in tokens for CP/relay-connected pipes,
   where extra FQCN segments would recreate an unconnected parent.
   Root-connected cells route via the root fall-through regardless of
   depth, so dotted agent ids keep working as they did before.

3. ClientSideController._do_learn uses the fl_ctx-aware log_exception
   helper like its sibling exception paths.

4. Rename test_client_ctl.py to client_ctl_test.py to follow the
   documented [module_name]_test.py convention.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@nvidianz

nvidianz commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

@chesterxgchen All four findings are addressed in cb0b917:

1. Alias parse breaking underscore tokens (blocking)CellIdentityResolver.resolve now applies the alias interpretation only in the two shapes that actually create an alias: a single-segment FQCN (legacy sibling alias), or a direct child of the local cell (a pipe connected through this cell, e.g. a relay — the only place _cell_fqcn emits the alias form, and mTLS handshakes only ever resolve directly-connected peers). Everywhere else the leaf is treated as a plain <token>_<mode> segment, so site-1.ext_trainer_active resolves to site-1 again under a sparse identity map. Added test_identity_resolver_maps_underscore_token_pipe_cell_to_site_identity pinning the documented ext-trainer repro shape. One note: a distant cell (not the connected relay) resolving a relay-alias FQCN now falls back to parts[0] as on main — that only affects the cert-exchange path for relay pipes, which behaves identically on main, so no regression.

2. Over-broad "." rejection — the fail-fast is now scoped to the CP/relay-connected branches, where extra segments would recreate the unconnected-parent bug. The ROOT_SERVER (and missing-parent) branches accept dotted tokens again and name the cell site-1.agent.v2_active, routed via the root fall-through exactly as dotted agent ids were on base. Tests updated: test_token_with_dot_is_allowed_for_root_connection, and the rejection test is parametrized over connected parents only.

3. log_exception_do_learn now uses self.log_exception(t.fl_ctx, "exception from do_learn_task") like the sibling exception paths (kept the bare except: since log_exception formats the traceback itself); the unit test now expects the two logger.error calls (contextualized message + traceback).

4. Test naming — renamed to client_ctl_test.py per the documented [module_name]_test.py convention, matching the other test files this PR touches; left the pre-existing test_* files in that directory alone to keep the diff scoped.

All touched suites pass locally (identity_binding, cell_pipe, client_ctl, authenticator, core_cell_routing — 109 tests), plus black/isort/flake8 on the changed files.

nvidianz and others added 4 commits July 2, 2026 14:52
The configured token (e.g. "{JOB_ID}") can resolve to an empty string
when no job id is available, which named the cell "<site>._<mode>"
(e.g. site-2._active) and made all such pipes on a site collide on a
degenerate name. _cell_fqcn now substitutes a fixed "default" token so
the name stays well-formed (site-2.default_active,
relay-1.site-2_default_active behind a relay), and CellPipe warns when
the fallback is used. Both ends of a pipe pair derive their own and the
peer's name from the same inputs, so they agree on the fallback.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@chesterxgchen

chesterxgchen commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

PR #4835 — CellPipe topology FQCN naming: three-angle review

Architect: sound, with concerns

The core move is architecturally correct, not a workaround: in CellNet, the FQCN is the routing address, and PR #4801 broke the invariant "address hierarchy ≈ connection topology." Naming the subprocess cell site-1._active makes its FQCN parent the cell it's physically connected to, so routing works through unmodified generic logic — the PR touches core_cell.py only in comments. A plain revert of #4801 would not have fixed the bug (the pre-#4801 flat name is gen-1 and fails the same routing dead-end at core_cell.py:1209), so the "third naming scheme" is justified: it's the first scheme that satisfies routing and mTLS identity simultaneously.

Concerns: the divergence class isn't eliminated — root-connected pipes still rely on a self-described "load-bearing" fall-through, so the invariant is conditional and a future cell type can silently trip the same gen>1 dead-end. The underscore now does double duty (separator inside a segment), contained by a right-anchored grammar and construction-time rejection seam is untested (below).Mixed-version compatibility is actually good — legacy flat aliases are still accepted
in both directions and the es the released scheme — but none of that reasoning is written down anywhere.

Principal engineer: core fix is correct (verified, not assumed)

The reviewer traced the failing and fixed paths end-to-end, confirmed the
core_cell.py diff is commento other cell families),algebraically checked that the rewritten _origin_matches_fqcn accepts exactly the
same set of names as the oled unit tests (pass). Thebundled ccwf/client_ctl.py change also fixes a real latent bug — the old
self.logger.log(msg) passed (a TypeError inside theexcept handler), and jobs that hit task-fetch failures previously finished as FINISHED:COMPLETED instead of reporting EXECUTION_EXCEPTION.

Non-blocking findings: (1) the new hard ValueErrors on dotted/underscored tokens are an upgrade break for custom FlareAgentWithCellPipe agent ids that previously worked — deliberate and loud, but release-note it;

(2) a mixed-version CJ↔subprocess pair (training venv on older nvflare) fails with "peer FQCN mismatch" — inherent to any rename, second rename in a row for this cell, also a release-note item;
(3) minor test gaps — no relay-shape routing test and no direct unit tests for parse_cell_pipe_alias.

Security engineer: overall risk LOW

  • Spoofing/impersonation — not weakened. Every enforcement path still requires the victim site's certificate: identity resolution falls back to the FQCN root segment checked against the peer CN, and mismatches close the connection (fail closed). A site-2 cert claiming site-1._active is rejected.
  • Stream auth — set-equivalent. The old and new acceptance rules admit exactly the same strings; no wildcard or broadening. The site vs site_x spoof defense is preserved with negative tests.
  • Cross-job isolation — unchanged. Auth was site-scoped before and after; nothing keyed off exact FQCN parentage was removed.
  • The PR improves posture in three places: alias parsing is now positionally gated (the old code let a cert with CN=ext bind to site-1.ext_trainer_active at any depth — genuinely tightened); ambiguous tokens fail closed at construction; and one shared make/parse grammar replaces a duplicated parser that could drift.

Two low-severity residuals, both independently found by all three reviewers:

  1. CP-local alias ambiguity (untested seam). At the CP itself, a pipe child with an underscore token (site-1.simulate_job_active) alias-parses to a fabricated owner (simulate) — the PE reviewer reproduced this by running the resolver. Failure mode is false rejection (fail closed) and CP-internal hops are typically non-mTLS, so it's parity-level severity — but the new tests only pin local_fqcn="server", not the CP resolving its own child.
  2. Empty-token fallback is now fail-open. Pre-PR an empty token produced an invalid FQCN and failed loudly; now two token-less pipes on one site silently collide on site-1.default_active (warned only). Same-site trust boundary, misconfiguration-triggered — but a deliberate loosening worth a process-unique suffix instead.

Consolidated recommendation

Suggested follow-ups (none blocking): add a local_fqcn= identity-binding test for underscore-token pipe children; make the empty-token fallback unique or raising; add release notes for the new token restrictions and the CJ↔subprocess version-mismatch behavior; put a compat/naming-scheme note next to the grammar in fqcn.py; and consider a connected-ancestor last-resort in _try_find_ep as the deeper routing fix. The bundled client_ctl.py error-reporting fix is correct but separable — fine to keep, worth a mention in the PR description.

… scheme

- An empty CellPipe token now raises ValueError in all branches instead
  of silently falling back to a shared "default" name: an empty token
  cannot uniquely name the cell, and a generated per-process fallback is
  not possible because the two ends of a pipe pair derive each other's
  names independently. This restores the loud failure the pre-PR code
  had (invalid FQCN) with a clearer message.
- Pin the CP-local alias seam in a test: a CP resolving its own
  underscore-token pipe child parses a fabricated owner (fail closed,
  non-mTLS in practice) - documented rather than special-cased since
  the name alone cannot distinguish it from a relay alias.
- Add direct unit tests for make/parse_cell_pipe_alias and relay-shape
  routing tests for <relay>.<site>_<token>_<mode> cells.
- Document the CellPipe naming-scheme history and mixed-version
  behavior next to the alias grammar in fqcn.py.
- Release-note the new token restrictions and the CJ/subprocess
  version-mismatch behavior in flare_280.rst.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@nvidianz

nvidianz commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

@chesterxgchen Thanks for the three-angle review. The suggested follow-ups are addressed in 16916c5:

1. Empty-token fallback → now raising. Of your two options (unique or raising), a process-unique suffix is structurally impossible for CellPipe: the two ends of a pipe pair derive each other's cell names independently from the same (site, token, mode, parent) inputs, so any per-process value breaks the pair's rendezvous. _cell_fqcn now raises ValueError on an empty token in all branches — restoring the loud failure the pre-PR code had (empty tokens produced an invalid FQCN and crashed at cell creation), with a clearer message and no fail-open collision window.

2. CP-local alias seam → pinned in a test. test_identity_resolver_cp_alias_parses_own_underscore_token_child (local_fqcn="site-1") documents the fabricated-owner parse for site-1.simulate_job_active, with a comment explaining why it's fail-closed, non-mTLS in practice, and not distinguishable from a genuine relay alias by name alone.

3. Test gaps — added fqcn_test.py with direct make/parse_cell_pipe_alias tests (round-trip incl. underscore owners, right-anchored parsing, and seven rejection shapes), and two relay-shape routing tests in core_cell_routing_test.py (<relay>.<site>_<token>_<mode> reaching its peer and a server job cell through the connected relay).

4. Compat/naming-scheme note — the grammar block in fqcn.py now documents the three naming schemes (flat pre-2.7, the unreleased #4801 hierarchical form, and the current topology leaf), which mixed-version shapes are still accepted, and the CJ↔subprocess "peer FQCN mismatch" constraint.

5. Release notesflare_280.rst Compatibility and Migration Notes now covers the new token restrictions (fail-fast ValueErrors, including custom FlareAgentWithCellPipe agent ids) and the CJ↔training-process version-alignment requirement.

6. PR description — updated to call out the separable ccwf _do_learn error-reporting fix.

Deferred as a separate follow-up: the connected-ancestor last-resort in _try_find_ep (the deeper fix that would remove the "load-bearing fall-through" caveat for root-connected pipes). It changes generic routing for all cell types, so I'd rather not fold it into this PR — happy to file it as its own item.

All cellnet/pipe/authenticator/ccwf suites pass (201 tests) plus black/isort/flake8 on the touched files.

CellPipe leaves are now explicitly marked:
- plain leaf "cellpipe-<token>_<mode>" for root/own-CP connections
- alias leaf "cellpipe-alias-<owner>_<token>_<mode>" behind a relay

This makes the alias grammar unambiguous instead of positionally gated:
a plain leaf whose token contains "_" (e.g. cellpipe-ext_trainer_active)
can never be misread as an alias, so identity resolution recognizes the
marked alias at any depth (restoring owner resolution from distant
cells, e.g. cert exchange through the server) and resolves any plain
pipe leaf to the identity of the cell it is named under - which also
fixes the CP-local seam where an underscore-token pipe child parsed to
a fabricated owner. Stream auth accepts the bare legacy grammar only
for whole-FQCN (pre-2.8 flat) origins and requires the alias marker at
any depth.

Tokens starting with "alias-" are rejected at construction so the plain
namespace cannot collide with the alias namespace.

Verified end-to-end with a POC FedAvg ext-process job (2 clients,
2 rounds, 109MB streamed model): pipes rendezvous under the new names
and the job completes cleanly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@nvidianz

nvidianz commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up in 100e8e8: CellPipe leaves are now explicitly marked, per review discussion —

  • plain leaf: cellpipe-<token>_<mode> (root / own-CP connections)
  • relay alias: cellpipe-alias-<owner>_<token>_<mode>

This eliminates the alias/plain ambiguity grammatically instead of containing it positionally:

  • Identity resolution recognizes the marked alias at any depth (restoring owner resolution from distant cells, e.g. cert exchange through the server, which the positional gating had traded away), and resolves any plain cellpipe- leaf to the identity of the cell it's named under. That fixes the CP-local seam outright — site-1.cellpipe-simulate_job_active resolved by the CP now yields site-1, not a fabricated owner — so the seam test asserts correct behavior instead of documenting a quirk.
  • Stream auth accepts the bare legacy grammar only for whole-FQCN (pre-2.7-style flat) origins and requires the cellpipe-alias- marker at any depth, with a new negative test pinning the rejection of unmarked alias shapes at depth.
  • Reserved namespace: tokens starting with alias- are rejected at construction so a plain leaf can never collide with the alias namespace.
  • Legacy compat is unchanged (flat 2.7 aliases still accepted in identity resolution and stream auth); the CJ↔subprocess same-scheme requirement is inherent to any rename and already release-noted. Token rules (_/. behind relays, . behind CP, non-empty) are unchanged — the marker disambiguates alias vs plain, not the fields inside the alias.

Verified beyond unit tests (213 passing across cellnet/pipe/authenticator/ccwf, plus direct grammar tests): an end-to-end POC FedAvg run (2 clients, 2 rounds, ext-process ScriptRunner, 109 MB streamed model) completes cleanly with pipes rendezvousing under the new names (site-1.cellpipe-<jobid>_active/passive), zero errors in the server log. Docs updated: naming-scheme history in fqcn.py and release notes now describe the marked forms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants